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Deep learning has effectively solved complicated challenges ranging from 
large data analytics to human level control and computer vision. However, 
deep learning has been used to produce software that threatens privacy, 
democracy, and national security. Deepfake is one of these new applications 
backed by deep learning. Fake images and movies created by Deepfake 
algorithms might be difficult for people to tell apart from real ones. This 
necessitates the development of tools that can automatically detect and 
evaluate the quality of digital visual media. This paper provides an overview 
of the algorithms and datasets used to build deepfakes, as well as the 
approaches presented to detect deepfakes to date. By reviewing the 
background of deepfakes methods, this paper provides a complete overview 
of deepfake approaches and promotes the creation of new and more robust 
strategies to deal with the increasingly complex deepfakes. 
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1. INTRODUCTION 

It's possible to generate videos that appear to show the target person doing or saying things that the 
source person does using techniques known as "Deepfakes" which derive their name from the words "deep 
learning" and "fake." The term "face-swap" is used to describe this type of deep fake. Deepfakes can also be 
lip-syncs or puppet-masters, depending on how the information is generated using artificial intelligence [1], 
[2]. This term is wide. A lip-sync deepfake refers to a video that has its lips movements synchronized to an 
audio recording. A puppet master deepfake includes footage of a target individual (i.e., puppet) animated 
following the facial emotions, eye and head movements of another person seated in front of a video camera 
[3], [4]. While traditional visual effects and computer graphics may be used to make certain deepfakes, deep 
learning models like GANs (i.e., "generative adversarial networks") and auto-encoders, which have been 
widely utilized in the sector of computer vision, are now the usual underlying mechanism for deepfake 
generation [5]. Figure 1 depicts a typical GAN model, which includes two neural networks: a generator and a 
discriminator. When analyzing a person's facial motions, these models help to synthesis images of another 
person with similar expressions and movements [6], [7]. 

To train models to produce videos andphoto-realistic images, deepfake approaches often require a 
huge quantity of image and video data. Aside than generating realistic digital persons, deepfakes are used in 
visual effects, Snapchat filters, digital avatars, creating voices for those who have lost their voices, and 
updating movies without reshooting them [8]. As depicted in Figure 2, deepfake detection is divided into two 
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key categories: fake video detection techniques and fake image [5]. Discovering the truth in the digital world 
has become increasingly crucial. It is considerably more difficult when dealing with deepfakes, as they are 
predominantly utilized for harmful reasons and virtually anybody can construct deepfakes with existing 
deepfake tools today. 
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Figure1. The architecture of GAN Figure 2. Categories of deepfake detection techniques 


Face swapping between source and destination images utilizing autoencoder-decoder pairing 
structure requires two encoder-decoder pairs, with each pair trained independently on a different image set 
while sharing the encoder's parameters as shown in Figure 3. In other words, the encoder network is identical 
between the two pairs. Faces typically share features like eyes, noses, and mouth positions, making it easy for 
the common encoder to detect and learn the similarities between two sets of face images [7]. 

There have been various proposed approaches to identify deepfakes [9]. Most of them are based on 
deep learning, which has led to a struggle between malicious and beneficial applications of deep learning 
techniques. Defense advanced research projects agency (DARPA) established a research program in media 
forensics (called MediFor, or Material Foren) to speed up the development of technologies to detect 
fraudulent digital visual media as a response to the threat of deepfakes or face-swapping technology [10]. To 
make it clear, some examples face swaps from the dataset are shown in Figure 4. According to dimensions 
scholar, the quantity of deepfake publications has risen dramatically over the past few years, as seen in 


Figure 5 [11]. 
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Figure 4. Face swapping example 
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Figure 5. Quantity of deepfake publications in the period from (2016-2021) 


2. DEEPFAKE CREATION AND DETECTION METHODS 

Deepfake shave gained popularity as a result of the high quality of their altered videos and the 
accessibility of their applications to people with varying levels of computer expertise, from experts to 
beginners. Deep learning techniques are mostly used to construct these applications [12]. Therefore, many 
computer vision researchers have taken up the deepfake detection problem. For example, Chang et al. [13] 
developed deepfake face image detection using an improved of the VGG network (i.e., "visual geometry 
group") based on augmentation and image noise. The image noise map is extended to weaken face features 
using an SRM filter layer (i.e., "style recalibration module"). Finally, the network is fed updated blurring 
images to train and detect fraudulent photos. Using the Celeb-DF dataset, NA-VGG out performed current 
false image detectors. Shad et al. [14] introduced a several ways to identify deepfake images and do 
comparison analyses were put in place. In this study, eight CNN structures are used to find deep fake images 
in a large data set. This is a comparison of how CNN (i.e., "Convolutional Neural Networks") can be used to 
distinguish between real and deep fake images. 

A two-stream network was proposed by Zhou et al. [15] to detect face manipulation. On the other 
hand, GoogLeN was trained to detect artifact manipulation in the face categorization table, using a 
correction-based approach. For this new dataset, two online face-swap apps were used to modify 2010, 
resulting in 2010 modified photos. This data set was then utilized to assess the proposed two-stream network. 
In comparison to previous methods, the method's success is proved by its ability to learn both effects 
manipulation and residual hidden noise features. 

Wodajo and Atnafu [16] created and developed a generalized deepfake video detection model using 
convolutional neural network (CNNs) and transformer. A convolutional vision transformer has two parts: a 
CNN and ViT (i.e., vision Transformer). ViT uses the attention approach to classify the acquired data, 
whereas CNN extracts the learnable features. The model trained on the DFDC dataset obtained 91.5% 
accuracy, a loss value of 0, and AUC of 0.91. In 2019, Zhang and Zhao [17] proposed a new deep learning- 
based method for identifying AI face photos from real-world facial images. Artificial intelligence (1.e., AI) 
facial recognition has been improved by using a new model based on deep learning and detection-level 
analysis. The proposed model has various advantages over current models, such as faster training period, 
fewer layers, and more efficiency. In Li and Lyu [3], a new method based on deep learning is explored for 
detecting false videos generated by artificial intelligence from actual videos. These fake videos are referred to 
deepfake videos. The existing deepfake algorithm can only generate images of restricted resolution, which 
then need to be adjusted further to match the faces to be substituted in the source video. This method is based 
on the observations that the current deepfake algorithm can only generate these images. This method has 
been assessed through the utilization of a number of different sets of deepfake videos that demonstrate its 
viability in application. Mo et al. [18] developed a CNN-based algorithm for detecting fake facial images and 
provide extensive experimental results showing that the proposed algorithm can accurately discriminate 
between false and real facial photos with an average accuracy of over 99.4%. Aside from that, while current 
GAN-based techniques can generate realistic-looking faces (or other visual objects and scenes), they will 
eventually generate statistical artifacts that prove fakes. 

Hsu et al. [19] this study proposes a unique DeepFD (i.e., deep forgery discriminator) based on 
embedding the contrastive loss to detect fraudulent/manufactured images formed by modern GANS. 
Researchers could create a deep forgery discriminator to efficiently detect computer-generated photos (i.e., 
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DeepFD). The proposed technology is the first to detect fraudulent photos. The contrastive loss may capture 
the combined discriminative properties of different GANs' fake images, which is the key contribution. It also 
improved classification performance and can visualize exaggerated aspects in fake photos. Experiments show 
that DeepFD effectively detected 94.7% of fake images made by advanced GANs. Kolagati et al. [20] 
constructed a deep hybrid neural network model to detect deep-fake videos. The facial landmarks detection is 
used to obtain information on a wide range of facial characteristics from the videos. In order to learn the 
difference between real and false videos, this data is assembled into a multilayer perceptron (i.e., MLP). 
Table 1 summarizes the most relevant research in the field of deepfake detection. 


Table 1. The comparison of the relevant studies 


Performance 
Reference Method Year Advantages evaluation 
Accuracy (%) 
Chins. ef Bi. J@éanvelinanal neural NA-VGG improved the detection of deepfake face images and 
[13] g i trok 2020 the accuracy of this method. is much higher than several 85.70 
EAW OT deepfake detection models. 
It can detect tampering artifacts as well as hidden noise residual 
Zhou et al. [15] Neural networks 2017 features. This method outperforms each stream by a large 92.70 
margin. 
Wodajo and Convolutional vision This method's ability to detect deepfake, and quickly determine 
2021 ; 3 91.50 
Atnafu [16] transformer if the images are real or not. 
Coiivolutjonal neuial Detect deepfake images with high accuracy. Accuracy, 
Shad et al. [14] Telok 2021 precision, Fl-score, and area under the ROC curve were all 99.00 
highest for VGGFace. 
The XGBoost algorithm uses more precise approximations to 
Ismail et al. find the optimal tree model. It's designed to be adaptable and 
[21] XGBoost 201 quick. It presents a fast and precise parallel tree boosting that ROTS 
solves many data science problems. 
Ahmed et al. Rationale augmented In a real-time environment, models that have better performance 
convolutional neural 2021 A : 95.77 
[22] and are smaller in size will be more useful. 
network 
Rosser eka Pre-training on ImageNet and larger network capacity allow 
[23] ` Xception-Net 2019 XceptionNet to achieve compelling results on low quality 95.73 
images while maintaining reasonable performance. 
Zhang Deep learning and 
and Zhao [17] ELA Detection 2019 Less layers, less training time, more efficiency 97.00 
; Transforms leave Simple image processing operations on an image can simulate 
Li and Lyu [3] distinctive artifacts 2019 artifacts directly. PARO 
; 33 This method reconstructs real face images better than other 
Khalid One-class_ variational j ; 
2020 methods. This shows that a one-class approach can effectively 98.20 
and Woo [24] auto-encoder E EET ade . 
distinguish real (normal) images from anomalous (abnormal). 
3D nvoluta The proposed network has fewer parameters than other 
Liu et al. [25] 2021 networks. As well as reduces deployment consumption while 99.83 
neural network Sites ; 
maintaining detection performance. 
Schroff et al. FaceNet- unified eer : : Ta . 
[26] embedding 2015 A significant increase in the efficiency of representation. 99.63 
Parkhi et al. Convolutional neural This method provides the best performance and can be applied 
2015 : 98.95 
[27] network to a wide range of other tasks. 
ear and Delp Saat neural 018 With only 2 seconds of video data, this algorithm can accurately 97.10 
à predict whether a video has been manipulated. ` 
Hsu et al. [19] Generative 2018 In terms of precision and recall rate, this approach outperforms 94.70 
adversarial network other baseline approaches. 
Marra et al. Generative 2018 Deep networks, especially Xception-Net, are more robust and 89.00 
[29] adversarial network work well even when training-test mismatches. g 
Convolutional neural A high visual quality fake face image can be distinguished from 
Mo et al. [18] ietork 2018 a real one using this method, which is effective in many 99.40 
situations. 
c lutional 1 The proposed system automatically extracts many abstract 
Dang et al. [30] eens nee, DOTS features, overcoming many challenges. and the model 98.00 
network mh : 
performed well on the dataset's imbalanced scenario. 
F Deep multilayer- ‘ 34 ; x ; f 
Kolagati at el. Convolutional Neural 2022 The hybrid system is ideal for screening deepfake videos with 84.00 
[20] high speed and low computational resources. 
Network 
Khodabakhsh Ganvolational neural Best results by a wide margin. Stable decision points are 
2018 confirmed by lower error rates in conjunction with a lower EER 99.60 
et al. [31] network 


error. 
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Using artificial intelligence techniques (i.e., cutting-edge), a developer created software that could 
replace one person's face with another. Deepfakes became popular in early 2018. A computer was fed a large 
number of still images of one individual and video footage of another in order for the procedure to function. 
With matching expressions such as lip-synch and other motions, the software then created a new film (i.e., 
fake) [12]. Table 2 provides an overview of deepfake's tools and features. 


Table 2. List of the most popular deepfake tools 


Reference Tools Features 

32 DeepFaceLab Multiple methods of face extraction are supported. 

10 Faceswap-GAN Auto-encoder architecture with adversarial and perceptual loss. 

10 Faceswap Using two pairs of encoder-decoder. 

33 Few-Shot Face Translation Latent embeddings Extraction for GAN processing using a model of face 
recognition (i.e., pre-trained). 

34 DFaker A loss function called DSSIM is utilized to reconstruct a face. 

34 Deepfaketf Similar toDFaker but using Tensorflow structure. 

35 AvatarMe Create 3D faces from arbitrary “wild” images 

36 StyleRig Annotations are not required for self-monitoring. 

37 MarioNETte Identity adaption does not necessitate a further fine-tuning process. 

17 DiscoFaceGAN Adopt 3D priors in adversarial. 

13 StyleGAN In we new architecture, high-level properties are automatically and unsupervised 
separated. 

12 Face2Face Face-to-face (i.e., Real time) reenactment of a monocular target video. 

36 Neural Textures Feature maps learned during scene capture and stored on top of 3D mesh proxies. 

38 Transformable Bottleneck Fine-grained 3D image modification. 

Networks 
39 Neural voice puppetry Synthesis of audio-driven facial video. 


3. EXPERIMENTAL EVALUATION 

Generally, the performance of the algorithm (i.e., deepfake detection) is evaluated using the AUC 
scores (i.e., "Area under the curve") and ROC curve (i.e., "receiver operator characteristic"). The probability 
curve is known as the ROC, while the AUC represents the degree or amount of separation [40]-[47]. In other 
words, the ROC indicates how accurately the model predicts 0 and 1 classes as shown in Figure 6 [32]. The 
AUC represents the model's ability to identify between fake and real video [48]. Detection methods based on 
deepfake require training data and testing. As a result, the need for large-scale deepfake video datasets is 
growing. List of some current deepfake datasets are shown in Table 3. In addition, Figure 7 displays our 
evaluations of several existing deepfake datasets that vary in terms of release year, data sample size, and total 
number of distinct individuals [49]-[51]. To present the frame-level AUC scores for each mentioned dataset, 
six of the most effective state-of-the-art deepfake detection techniques that have been compared in this paper 
and the obtained results are listed in Table 4. Moreover, Figure 8 depicts the ROC curves for each technique 
in different large datasets such as FWA, MESO-4, MESOLNCEPTION-4, XCEPTION-C-23, XCEPTION- 
C-40, and DSP-FWA as shown respectively in Figures 8(a)-(f). 
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Figure 6. Probability of ROC curve 
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Table 3. Quantitative analysis of existing deepfake datasets 


Real 


Deepfake 


Date of 


Referçnes Daa Frame Video Frame Video release Descuption 
ean The Vid-TIMIT dataset was used to create 640 
LQ Dec deepfake videos using Faceswap-GAN and the 
[1] DF- 34.00k 320 34.00k 320 2018 resulting Deepfake-TIMIT videos. DF-TIMIT-HQ 
TIMIT- and DF-TIMIT-LQ are equal-sized subsets of the 
HQ videos. 
Oct DFDC dataset consists of 4,113 deepfake videos 
[40 DFDC 488.40k 1.131 1,783.30k 4,113 2019 based on 1,131 original videos of 66 persons of 
diverse genders, ages, and ethnicities. 
Jai The FaceForensics++ dataset contains 1,000 actual 
[40 FF-DF 509.90k 1.000 509.90k 1,000 2019 YouTube videos and 1,000 synthetic ones generated 
with Faceswap. 
Nov UADFV has a total of 98 videos, 49 of which are real 
[41] UADFV 17.30k 49 17.30k 49 2018 and 49 of which are deepfake. FakeAPP and the DNN 
model are used to generate the deepfake videos. 
Se The deepfake detection dataset (Google/Jigsaw) 
[41 DFD 315.40k 363 2,242.7k 3,068 30 consists of 3,068 deepfake videos created from 363 
original videos. 
The Celeb-DF dataset contains 5,639 deepfake videos 
Nov. and590 real videos. The normal frame rate for videos 
[42 CAPE 225,407, 320 2,116.80K._ 91008 2019 is 30 frames per second, resulting in an average video 


length of approximately 13 seconds. 
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Table 4. AUC scores of the frame level 


Reference Technique 


UADFV DF-TIMIT-LQ  DF-TIMIT-LQ FF-DF DFD DFDC Celeb-DF 


(%) (%) (%) (%) (%) (%) (%) 
[43] DSP-FWA 97.70 99.90 99.70 93.00 81.10 75.50 64.60 
MESOINCE- 
[44] PTION-4 82.10 80.40 62.70 83.00 75.90 73.20 53.60 
[45] FWA 97.40 99.90 93.20 80.10 74.30 72.70 56.90 
[46] XCEPTION-C-23 91.20 95.90 94.40 99.70 85.90 72.20 65.30 
[46] XCEPTION-C-40 83.60 75.80 70.50 95.50 65.80 69.70 65.50 
[44] MESO-4 84.30 87.80 68.40 84.70 76.00 75.30 54.80 
4. CONCLUSION 


Trust in media content has been eroded by deepfakes because seeing them is no longer equivalent to 


believing in them. In addition to causing distress and harm to the people they target, disinformation and hate 
speech propagated by them can also heighten political tensions, incite the population to violence or even war. 
Since deepfakes are becoming easier to create and spread on social media platforms, this is especially 
important now that the technology to do so is becoming more accessible. This survey provides an overview 
of deepfake creation and detection methods and discusses challenges, and trends. This study will help the 
artificial intelligence research community tackling deepfakes. 
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